The smallest extraction problem

نویسندگان

چکیده

We introduce landmark grammars , a new family of context-free aimed at describing the HTML source code pages published by large and templated websites therefore effectively tackling Web data extraction problems. Indeed, they address inherent ambiguity HTML, one main challenges extraction, which, despite over twenty years research, has been largely neglected approaches presented in literature. then formalize Smallest Extraction Problem (SEP), an optimization problem for finding grammar that best describes set contextually extract their data. Finally, we present unsupervised learning algorithm to induce from sharing common template, automatic system. The experiments on consolidated benchmarks show approach can substantially contribute improve state-of-the-art.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Generalized Smallest Grammar Problem The Generalized Smallest Grammar Problem

The Smallest Grammar Problem – the problem of finding the smallest context-free grammar that generates exactly one given sequence – has never been successfully applied to grammatical inference. We investigate the reasons and propose an extended formulation that seeks to minimize non-recursive grammars, instead of straight-line programs. In addition, we provide very efficient algorithms that app...

متن کامل

The Smallest Grammar Problem Revisited

In a seminal paper of Charikar et al. on the smallest grammar problem, the authors derive upper and lower bounds on the approximation ratios for several grammar-based compressors, but in all cases there is a gap between the lower and upper bound. Here we close the gaps for LZ78 and BISECTION by showing that the approximation ratio of LZ78 is Θ((n/ logn)), whereas the approximation ratio of BISE...

متن کامل

The Generalized Smallest Grammar Problem

The Smallest Grammar Problem – the problem of finding the smallest context-free grammar that generates exactly one given sequence – has never been successfully applied to grammatical inference. We investigate the reasons and propose an extended formulation that seeks to minimize non-recursive grammars, instead of straight-line programs. In addition, we provide very efficient algorithms that app...

متن کامل

the problem of divine hiddenness

این رساله به مساله احتجاب الهی و مشکلات برهان مبتنی بر این مساله میپردازد. مساله احتجاب الهی مساله ای به قدمت ادیان است که به طور خاصی در مورد ادیان ابراهیمی اهمیت پیدا میکند. در ادیان ابراهیمی با توجه به تعالی خداوند و در عین حال خالقیت و حضور او و سخن گفتن و ارتباط شهودی او با بعضی از انسانهای ساکن زمین مساله ای پدید میاید با پرسشهایی از قبیل اینکه چرا ارتباط مستقیم ویا حداقل ارتباط وافی به ب...

15 صفحه اول

Inapproximability of the Smallest Superpolyomino Problem

We consider the smallest superpolyomino problem: given a set of colored polyominoes, find the smallest polyomino containing each input polyomino as a subshape. This problem is shown to be NP-hard, even when restricted to a set of polyominoes using a single common color. Moreover, for sets of polyominoes using two or more colors, the problem is shown to be NP-hard to approximate within a O(n1/3−...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the VLDB Endowment

سال: 2021

ISSN: ['2150-8097']

DOI: https://doi.org/10.14778/3476249.3476293